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The likelihood ratio (LR) is largely used to evaluate the relative 
weight of forensic data regarding two hypotheses and for its assess¬ 
ment Bayesian methods are widespread in the forensic field. How¬ 
ever, the Bayesian ‘recipe’ for the LR presented in most of literature 
consists in plugging-in Bayesian estimates of the involved nnisance 
parameters into a frequentist-defined LR: frequentist and Bayesian 
methods are thus mixed, giving rise to solutions obtained by hybrid 
reasoning. This paper provides the derivation of a proper Bayesian 
approach to assess LR for the ‘rare type match problem’, the sit¬ 
uation in which the expert wants to evalnate a match between the 
profile of a suspect and that of a trace from the crime scene, and 
this profile has never been observed before in the database of refer¬ 
ence. Bayesian LR assessment using the two most popular Bayesian 
models (beta-binomial and Dirichlet-multinomial) is discussed and 
compared to corresponding plug-in versions. 

Key words: Bayesian plug-in, evidence evaluation, rare type match, Y chromosome STR, 
beta binomial model, Dirichlet-multinomial model, hierarchical bayesian model. 

1 Introduction 

One of the main challenges of forensic science is that of properly evaluating the match 
between the characteristics of a crime stain (for instance a DNA profile) and the cor¬ 
responding characteristics of some material from a known source (for instance from a 
suspect). Typically, a couple of mutually exclusive hypotheses is defined, of the kind of 
‘the crime stain came from the suspect’ (hp) and ‘the crime stain came from an unknown 
donor’ (hd)- 
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The forensic expert is given some data D which can typically be split into evidence, 
data directly related to the crime, and background, additional data not directly related 
to the crime and only pertaining some nuisance parameter 9 involved in the assessment 
of the likelihood ratio. Evidence and background data will be modeled through random 
variables E and B respectively. In particular, we are interested in the situation in which 
the forensic expert is asked to evaluate the match between a DNA profile of a suspect 
and the DNA profile of a stain found at the crime scene. It is intuitive to understand 
that (one of) the nuisance parameter(s) involved in this evaluation is the proportion of 
people with the same profile among the possible perpetrators; the more this profile is 
rare the more the suspect is in trouble. This parameter is unknown and thus the expert 
is given (or asks for) a database containing a list of DNA profiles from a sample from 
the population of possible perpetrators. The main difference between the frequentist and 
the Bayesian methodology is that the first considers the nuisance parameter 9 and the 
correct hypothesis h as fixed (without distribution) unknown quantities, while the second 
models the expert’s uncertainty about their value through random variables, whose prior 
distributions reflect prior expert’s beliefs. 

The largely accepted method for evaluating the data in order to discriminate between 
the two hypotheses of interest, is the calculation of the Bayes factor (BE), regularly called 
in forensic context likelihood ratio (LR) and defined as the ratio of the probability of 
observing the data under the two competing hypotheses: 

Bv{E = e,B = b\H = hp) 

Ft{E = e,B = b\ H = hd) 

In the Bayesian framework (the one of interest for this paper) Pr is the joint distribution 
of all the random variables in the model {E, B, H, and 0). 

On the other hand, frequentists, which consider 9 and h as fixed quantities, use a 
different probability (here denoted as 'Pr) which can be expressed in terms of the Bayesian 
Pr, in the following way: Pr{-) := Tr®(-) = Pr(- | Q = 9,H = h). Thus, the frequentist 
likelihood ratio (denoted as L'lQ is defined as 

= e, B = b) _ = e,B = b\ e = 9,H = hp) 

“ PtI^{E = e,B = b) ~ Fj:{E = e,B = b\ Q = 9,H = hd)' 

Depending on the preferences of the expert, frequentist or Bayesian likelihood ratios can 
be used for the evaluation of forensic data. Once a choice has been made, it is important to 
be consistent with it, while literature often mixes up the two. To our knowledge, this paper 
and (Cereda, 2015b) constitute the only forensic literature that discuss the differences 
between the two approaches. (Cereda, 2015b) is concerned with the theoretical foundations 
of frequentist solutions, while this paper provides a simple and careful derivation of the 
proper Bayesian LR, for the rare type match problem (described in Section 2): the situation 
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in which the DNA profile of the crime stain and that of the suspect match but they are not 
among the DNA profiles observed in the reference database. In Section 3 we will discuss 
the fact that influential Bayesian forensic literature (Weir, 1996; Aitken and Taroni, 2004; 
Taroni et ah, 2010, 2014; Sjerps et ah, 2015) seems to suggest the use of frequentist 
defined likelihood ratio {LIQ and use Bayesian methodologies only inasmuch they provide 
a Bayesian estimate of 6 to be plugged into Others (Curran, 2005; van der Bout 
and Alberink, 2015), treat the likelihood ratio as function of 9 and provide its posterior 
distribution with respect to the posterior distribution of 9 given the data. However, one 
of the main points of discussion is that there is no need of such hybrid derivations since 
the proper Bayesian LR is often very easy to obtain: this paper shows how this should be 
done, taking advantage of a very useful Lemma, presented in Section 4. However, for this 
method to be advisable, the Bayesian prior should be chosen in a sensible way, reflecting 
the expert’s opinion, and not by mathematical convenience as often happens. 

The two most common Bayesian models (beta-binomial and Dirichlet-multinomial) 
are discussed in Sections 5 and 6. They are general enough to be applied to different 
kinds of forensic evidence evaluation, but will be here applied to DNA profiles obtained 
using the Y-STR marker system, with the double aim of exploring the performance of the 
conventional Bayesian prior choices for the rare type match case for non autosomal DNA, 
and of showing how a full Bayesian LR is to be dehned and calculated. Sensitivity analysis 
and comparison with proposed hybrid plug-in solutions are carried out. We are not entirely 
satisfied with the performance of classical models for the rare haplotype problem, which 
we believe would need different kinds of prior, more realistic and tuneable, such as those 
proposed in Cereda (2015c). 

1.1 Notation 

Throughout the paper the following notation is chosen: random variables and their values 
are denoted, respectively, with uppercase and lowercase characters: x is a specific realisa¬ 
tion of X. Random vectors and their values are denoted, respectively, by uppercase and 
lowercase bold characters: p is a realisation of the random vector P. Bayesian probability 
is denoted with Pr(-), while density of a continuous random variable X is denoted by 
p{x). For a discrete random variable Y, the continuous notation p{y) and the discrete one 
Pr(y = y) will be alternately used. Frequentist probability will be denoted as 2’r. 

2 The rare type match problem 

The DNA sequence of an individual is a very long sequence of letters (each corresponding 
to 4 nucleotides) which code the genetic instruction necessary for the life of the individual. 
The entire sequence is unique to each individual (with the only exception of homozygous 
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twins, which share the same sequence), but a Y-STR DNA profile (also called haplotype) 
usually consists on a short list of integers (typically 7 to 23) that describes only certain 
characteristics of the DNA sequence of the individual on the Y chromosome (Gill et ah, 
2001). Moreover, the Y-STR profile is shared between men in the same patrilineal lin¬ 
eage. For these reasons there is a positive probability that two different persons share the 
same profile. This is why we need to weight how probable is the observed match under 
the hypothesis that the suspect left the stain against how likely is the match under the 
hypothesis that someone else left the stain. Glearly, assuming that a match is always 
detected correctly (no false positives), the first probability is 1, and the second depends on 
the proportion 9 of that profile among the population of potential perpetrator. Moreover, 
we are given a list of profiles from a sample of individuals belonging to the population of 
possible perpetrators to assess this frequency. Problems arise when the observed frequency 
of this characteristic is 0, the so-called ‘rare type match problem’. This problem is partic¬ 
ularly significant in case a new kind of forensic evidence for which the available database 
size is still limited is involved. This is the case, for instance, when using DIP-STR markers 
(Cereda et ah, 2014)). The same happens when Y-chromosome (or mitochondrial) DNA 
profiles are used, since the set of possible haplotypes is extremely large, and the coverage 
of available databases often limited. The case of Y-STR DNA will thus be retained here as 
an extreme but in practice common and important way in which the problem of assessing 
the evidential value of rare type match can arise. This is a very appropriate and paradig¬ 
matic example, since literature provides examples of different approaches to evaluate the 
evidential value of rare Y-STR profile match (Roewer et ah, 2000; Andersen et ah, 2013, 
e.g.), even though, in our opinion, a proper Bayesian derivation for the LR in rare type 
match case hasn’t been proposed yet. This problem is so substantial that it has been de¬ 
fined “the fundamental problem of forensic mathematics” by Brenner (2010). We will now 
review some of the methods proposed by literature. Most of them address the problem of 
assessing the frequency of a type with zero occurrence, sometimes under the name of ‘zero 
numerator problem’ (e.g. Winkler et ah, 2002). Notice that this is related, but not equiv¬ 
alent, to the problem of assessing the likelihood ratio in case of a rare type match. The 
empirical frequency estimator, also called naive estimator, that uses the frequency of the 
characteristic in the database, puts unit probability mass on the set of already observed 
characteristics, and it is thus unprepared for the observation of a new type. A solution 
could be the add-constant estimators (in particular the well known add-one estimator, due 
to Laplace (1814), and the add-half estimator of Krichevsky and Trofimov (1981)), which 
add a constant to the count of each type, included the unseen ones. However, this method 
requires to know the number of possible unseen types, and does not perform well when this 
number is large compared to the sample size (see Gale and Ghurch (1994) for additional 
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discussion). Moreover, Louis (1981) proposes the so-called ‘rule of three’, that states that 
if n is the size of the database, 3/n is a good approximation of the 95% upper bound for 
the frequency. This is also proposed in a Bayesian framework, by Jovanovic and Levy 
(1997); Winkler et al. (2002); Chen and McGee (2008). Alternatively, Good (1953), based 
on an intuition on A.M. Turing, proposed the nonparametric Good Turing estimator for 
the total unobserved probability mass, based on the proportion of singleton observations in 
the sample. An extension of this estimator is applied to the LR assessment in the rare type 
match in Cereda (2015b). For a comparison between add one and Good-Turing estimator, 
see Orlitsky et al. (2003). As pointed out in Anevski et al. (2013), the naive estimator, 
and the Good Turing estimator are in some sense complementary: the first gives a good 
estimate for the observed types, and the second for the probability mass of the unobserved 
ones. More recently, Orlitsky et al. (2004) have introduced the high profile estimator, 
which extends the tail of the naive estimator to the region of unobserved types. Anevski 
et al. (2013) improved this estimator and provided the consistency proof. Papers that 
address the rare Y-STR haplotype problem in forensic context are for instance Egeland 
and Salas (2008), Brenner (2010), and Cereda (2015a,c). Moreover, the Discrete Laplace 
method presented in Andersen et al. (2013), even though not specifically designed for the 
rare type match can be successfully applied to that extent (Cereda, 2015b). Bayesian non¬ 
parametric estimators for the probability of observing a new type have been proposed by 
Tiwari and Tripathi (e.g. 1989); Lijoi et al. (e.g. 2007); Favaro et al. (e.g. 2009). However, 
for the likelihood ratio assessment it is required not only the probability of observing a 
new species but also the probability of observing this same species twice (according to the 
defense the crime stain profile and the suspect profile are two independent observations). 
Cereda (2015c) is the first paper that addresses the problem of LR assessment in the rare 
haplotype case using Bayesian nonparametric models. 

3 The full Bayesian approach to LR 

The likelihood ratio assessment often involves some unknown nuisance parameters, denoted 
as 9. In our case, it is the proportion of individuals in the relevant population with Y- 
STR profile corresponding to that of the matching trace, or the entire vector containing 
the frequencies of all the profiles. The parameter of interest, h, is the unknown true 
hypothesis. Available data is made of evidence [E) directly related to the crime, and 
which helps in discriminate h, and additional background data (B) not directly related to 
the crime and only pertaining to the nuisance parameter 9. This is partially different from 
the ‘background information’ I as defined in Aitken and Taroni (2004); Taroni et al. (2014), 
but we can say that often background data can be thought of as part of the background 
information. 
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The difference between Bayesian and frequentist methods consists in how they treat 
the parameters 9 and h. A Bayesian models the uncertainty about their value by random 
variables 0 and H, which are given prior distributions p{9) and p{h). Frequentists consider 
them as hxed (i.e., without distribution) unknown quantities. The reader is invited to 
notice the difference between 0 and h: one is the parameter which we ‘test’ through the 
likelihood ratio (/i), the other [9) is a nuisance parameter involved in its calculation. Some 
assumptions about the conditional independence probability for the model can be made, 
valid both for the frequentist and for the Bayesian approach. 

a. The distribution of B given h and 9, only depends on 9. 

b. R is independent of E, given 9 and h. 

In our DNA example, condition a corresponds to ask that the sampling mechanism to 
obtain the database of reference is independent of which hypothesis is correct. This is 
true if, as it often happens, the database is collected before the crime. Condition b holds 
if the suspect has been found on a ground of different evidence that has nothing to do 
with DNA. In what follows we are going to use Bayesian network notation to specify the 
conditional independence relations of the proposed models. We expect the reader to be 
familiar with such a representation. 

3.1 Bayesian point of view. 



Figure 1: Bayesian network representing the dependency relations between E (evidence of 
the case) B (background data in the form of a database) 0 (population parameter) and 
H (hypotheses of interest). 


Bayesians deal with the uncertainty over the parameters 9 and h by considering their 
values as realisations of, respectively, random variables 0 and H. A full Bayesian model 
is defined when the prior joint probability distribution for all the random variables of the 
model (here E, B, H and 0) is given. This full Bayesian model can be thus represented 
by the Bayesian network of Figure 1, which is in turn equivalent to the following three 
conditions: 

Bayesian a. B is conditionally independent of H given 0. 
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Bayesian b. B is conditionally independent of E given 0 and H. 

Bayesian c. 0 is unconditionally independent of H. 

Notice that they are the Bayesian reformulation of conditions a. and b. mentioned 
above, with an additional condition (Bayesian c.) which corresponds to assuming that 
the Bayesian probability makes 0 and H independent, and is guaranteed for instance if 
prior beliefs on 9 and on h are assessed by people with different responsibilities and tasks: 
a judge for h and a DNA expert (or a statistician) for 9. However, notice that by definition 
the LR is independent of the prior belief over h. 

The structure of the Bayesian network (or, equivalently, the three conditions above) 
allows to factorise the joint prior as p{9,h,b,e) = p{9)p{h)p{b\9)p{e\9, h). The Bayesian 
probability Pr underlying the model is defined accordingly. As all Bayesian probabilities 
it is an expression of the subjective belief of the experts. This is achieved by choosing the 
prior distribution for 9 and h which reflects expert’s beliefs. The distribution of all other 
variables given 9 and h is defined by the model, and need no subjective assessment. 

The Bayesian likelihood ratio can be derived in the following way: 

LR = e, B = b\H = hp) Pr(£' = e\B = b, H = hp) f p{e\b, hp, 9) p{9\b, hp)d9 

Pt{E = e,B = e\H = h^) Pr(i? = e\B = b,H = hj) f p{e\b, hd, 9)p{9\b, hd)d9 

f9p{9\b)d9 E{e\B = b) 

~ f9^p{9\b) d9 ~ E(02|R = b)' 

where some simplifications have been carried out because of conditions a, b, and c. More¬ 
over, the first equality is due to the fact that it is implied by the network structure that 
B is also unconditionally independent of H, and the second to last equality is due to the 
fact that, given condition a it holds that p{9\b, h) = p{9\b), V/i. 

3.2 Prequentist point of view. 

As already mentioned, frequentists consider h and 9 as fixed quantities, whose unknown 
values correspond to, respectively, the true value of 9 and the correct hypothesis. The 
frequentist model can be thus seen as a special case of the Bayesian model described 
in Section 3.1, where 0 and H are given degenerate priors on 9 and h, respectively. 
Alternatively, one can express the frequentist probability (Pr in terms of the Bayesian Pr in 
the following way: (Pr{-) := (Pr^{-) = Pr(-|Lf = h,Q = 9). li the Bayesian Pr was subjective, 
the frequentist (Pr is a measure which is universally defined by Nature. 

Regarding h, according to prosecution its true value is hp, while according to defence 
it is hd- So one can think of two different frequentist probabilities: one for the prosecution 
(^P^^^) and one for the defence (^Pr®^). 

From a frequentist point of view, conditions a and b correspond to ask that: 
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Prequentist a. = b) = = 6), for all 9 and b. 

Prequentist b. 'Prf^{E = e|R = b) = (P = e), for all 6 , h, e, and b. 


Obviously, Bayesian c becomes irrelevant in the frequentist framework. 
The frequentist can be derived as: 


^Pr® (E = e,B = b) 

L^= 


(PrUE = e\B = &)< (P = b) ^tUE = e) 


¥rl{E = e,B = b) Tr^ {E = e\B = h)(Prl {B = b) {E = e 


B (n — 


J / T? — 


( 1 ) 


where the last equality is due to conditions Prequentist a, and Prequentist b. 

Stated otherwise, frequentists look at a value for LR|0 (read “LR given 0”), where the 
value 9 is fixed and has to be estimated through data. 

Through observations, frequentists attempt to get close to the true by choosing 
some estimator One possibility is to estimate 9 with a particular 9. This leads to the 
so-called plug-in estimation = L!K^{9) of the However, that’s not the only option 
(Cereda, 2015b). 

By looking at (1) the reader will realise that, if the frequentist approach is chosen, and 
under conditions a and b, one would get to the same result by evaluating only E or both E 
and B. This means that part of the information, namely B, is not useful to discriminate 
between the two hypotheses of interest (however, it usually plays an important role to 
obtain the estimate 9 to be plugged into the T^). The same does not hold in the Bayesian 
context. 

3.3 The Bayesian plug-in LR and the proper Bayesian LR 

R is now time to discuss the fact that important forensic literature (e.g. Evett and Weir, 
1998; Balding, 2005; Lucy, 2005) considers the likelihood ratio as ‘a measure of the proba¬ 
tive value of the evidence regarding the two hypotheses’ hp and h^. According to this, it 
indicates the extent to which E (and only E) is in favour of one hypothesis over the other. 
This is, in my opinion, the first important problem, since all data at disposal (namely E 
and B) should be evaluated. Even though this is irrelevant in the frequentist framework 
(see (1)), in the Bayesian framework for this definition to be appropriate we need to re¬ 
place the probability Pr with the posterior probability Pr*(-) = Pr(- \ B = b). Indeed, it 
holds that 


LR = 


Pr(E = e,B = b\H = hp) _ Pr(E = e \ B = b, H = hp) _ Fr* {E = e \ H = hp 


Ft{E = e,B = b\ H = hd) Ft{E = e \ B = b, H = hd) Ft*{E = e \ H = hd)’ 
It is as if we have separated the evaluation process in two steps: first we observe B = b, and 
update the probability Pr(-) to the posterior Pr*(-) = Pr(- \ B = b), and then we define 
the likelihood ratio as the ratio of the probabilities (Pr*) of observing (only) the evidence 
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E, under the two alternative hypotheses.^ This point is generally mistaken in literature 
and, as a consequence, the problem is split into two phases: first, a Bayesian estimate of 
0 using B, in the form of a posterior expectation is obtained, and then this estimate is 
plugged into a frequentist defined It is as if, instead of using a combined model such 
as that in Figure 1, they use two separate models as those in Figure 2: the left one is used 
to update the prior over the parameter. The second one is used to derive the likelihood 
ratio (with 9 considered as a fixed quantity). In the end, they replace 9 with the posterior 
expectation of 0 (now modelled through a random) given B. This method will be referred 
to in the paper as the ‘Bayesian plug-in method’, since it is wrongly considered Bayesian, 
but it actually plugs in Bayes estimates into likelihood ratio defined in a frequentist way. 



Figure 2: The two phase approach corresponding to Bayesian plug-in. 


The correct Bayesian approach would be either to evaluate both E and B simultane¬ 
ously, using the network of Figure 1, or in two steps: after the observation of B, we can 
update the model to the one represented in Figure 3, and use this for the evaluation of E. 



Figure 3: Updated Bayesian network after the observation of B. 


3.4 State of the art for DNA match evaluation 

In case of a DNA match, we can use the Bayesian network of Figure 4, which is equivalent to 

the network in Figure 1 with the only difference that here node E is split into two separated 

nodes, Eg and Ec representing the suspect’s and the crime stain’s profile, respectively. We 

^ Often, in literature (Taroni et al., 2014, e.g.), it is explicitly stated that 7, the so-called background 
information, is omitted in the notation. We then agree with this choice provided that B is part of I. 
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denote with 6 the unknown vector made of the population proportions of the different 
DNA profiles in the population, modelled through the random variable 0. Here we assume 
that we know the whole list of different DNA types present in the population of possible 
perpetrators, later we will consider the situation in which we don’t. With 0e^ we will 
denote the population frequency of the suspect’s (and crime stain’s) profile. 



Figure 4: Bayesian network for the DNA example. 


According to the prosecution, the suspect left the stain. This implies that Fi{Ec = 
es\Q = 0, Eg = Ss, H = hp) = 1, under the assumption that each true match is correctly 
reported. According to the defence, another person from the population left the stain, 
hence the probability of it being exactly of type Cc is equal to the population proportion 
of that profile: Pt{Ec = e^l© = 6, Eg = eg,H = hd) = Oes- Moreover, it holds that 
p{b\es, 9) Vi{eg\9)p{9) is proportional to p(0|e<j, h). The correct Bayesian procedure would 
lead to; 

_ Pt{E = e,B = b\H = hp) _ f Pr(Ec = es\H = hp,Q = 6 , Eg = eg)p{eg\9)p{b\9 , es)p{9)d9 
Pt{E = e,B = b\H = hd) f Pr(Ec = es\H = hd,& = 9, Eg = eg)p{eg\9)p{b\9, eg)p{9)d9 
_ Jp{9\eg,b)d9 _ 1 

~^eeMeg,h)de ~ 

On the other hand, the common approach taken by forensic literature would be to 
propose the following derivation for the likelihood ratio 

_ Pr(£' = e\B = b,H = hp) _ PrjEc = es\B = b,Eg = eg,H = hp) 2 i&yhyyy 
Pr(.E = e\B = b, H = hd) Pt{Ec = eg\B = b, Eg = eg,H = 

1 _ 1 
Pt{Ec = eg\H = hd) 9es' 

Then, 9^^ is replaced with 9e^ = 'E{9ejB = b). 

Let us focus on the second to last equality. By looking at the Bayesian network we could 
already realise that actually Ec is independent only if 9 is given. Thus, it is not allowed to 
simplified Eg in the conditioning unless we are considering frequentist probabilities {(Pr). 

Last equality is also incorrect. Indeed, it holds that Pr{Ec = Sg \ B = b, Eg = eg,hd) = 

/ Pr(£’c = eg \ & = 9,Eg = eg, hd)p{9 \ b,eg, hd)d9. 

It is true that, in the end, computationally, the difference amounts on using E(9ejB = 
b,Eg = eg) instead of E{9e^\B = b) (i.e., the well-known problem of whether to add or 
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not the suspect to the database before the posterior) and thus the plug-in can be seen as 
an approximation of the full Bayesian approach. However, it is an hybrid solution, thus 
conceptually ill defined. 

This hybrid approach is often considered Bayesian since the lack of knowledge about 
0 is dealt with using Bayesian posterior mean Oe^ = E(0eg|H) as a point estimate of 
Oes (Weir, 1996; Curran, 2005; Taroni et ah, 2010; Sjerps et ah, 2015). This is why we 
will refer to this way of proceeding as the Bayesian plug-in method. As pointed out in 
Weir (1996), “either the mean or the mode of the posterior distribution can serve as an 
estimate but each is merely a summary of the whole distribution”. Not only this method 
is hybrid and inconsistent, but it suffers from several weaknesses. For instance, one would 
obtain different T^s depending on whether one wants to estimate Of.^, l/^e^ or logio(l/0es): 
this arbitrariness is in some way entailed in the idea of ‘estimating’ the likelihood ratio. 
Moreover, as stated in Taroni et al. (2015), the likelihood ratio (meaning the Bayesian 
one) should be calculated, rather than estimated. Including B as part of the data to 
evaluate, and applying simple Bayesian theory, we can calculate the Bayesian LR, without 
any estimation needed. Notice that already Foreman et al. (1997) and Briimmer and Swart 
(2014) proposed a differentiation between the ‘plug-in estimates’ and the ‘full Bayesian 
analysis’. 

4 A useful Lemma 

Lemma 4.1 is a result regarding four general random variables A, X, Y, H whose condi¬ 
tional dependencies are represented by the Bayesian network of Figure 5. This is important 
due to the possibility of applying it to a very common forensic situation: the prosecution 
and the defence disagree on the the distribution of part of data (Y) but agree on the distri¬ 
bution of the other part (A). The distribution of X and Y depends on some parameter(s) 
modeled by A. 



Figure 5: Conditional dependencies of the random variables of Lemma4.1 


Lemma 4.1. Given four random variables A, H, X and Y, whose conditional dependen¬ 
cies are represented by the Bayesian network of Figure 5, the likelihood function for h, 
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given X = x and Y = y satisfies 

hk(h \ x,y) (X ^{p{y \ x, A, h) \ X = x). 

A proof of this lemma can be found in Cereda (2015b). We will see an application of 
it in Sections 5 and 6. 


5 Bayesian LR calculation, based on beta-binomial model 


In the binomial model, the database of size N is regarded as the result of a sequence 
of N Bernoulli trials with parameter 6, where suecess corresponds to the observation of 
the same type of that observed at the crime scene, and failure to the observation of any 
other type. Let’s denote by b the number of successes among these N experiments. When 
data is treated as a binomial outcome, the most conventional choice of the prior for the 
parameter 6 (probability of success) is the beta distribution, due to the famous conjugacy 
property. In forensic and medical statistics literature there are many examples for the use 
of this distribution for a genetic (autosomal) frequency (Weir, 1996; Gunel and Wearden, 
1995; Roewer et ah, 2000; Brenner, 2010; Buckleton et ah, 2011; Biedermann et ah, 2008, 
2013). 

0 ~ Beta(Q;, f3). 


The observation of the suspect’s profile Eg corresponds to another Bernoulli trial, a success¬ 
ful one in the case of interest (the suspect matches the crime stain type). The information 
provided by the database and the suspect’s type can be reduced to the count of profiles 
of this type in this sample of size + 1 (database and suspect) from the population of 
interest. 

R I 0 = 0 ~ Bin(Ar, 6) 

R,Rs I 0 = 6* ~ Bin(iV-h 1,6») 


Notice that according to the defence Ef. can be seen another Bernoulli experiment of the 
same kind. On the other hand, according to the prosecution it is equal to 1 with probability 
one. Stated otherwise, 


Ec 


Eg = 1, hh = /i ~ 



if if 
if if 


hp 

hd 


The Bayesian network of Figure 4 can be used for this model. Hence, we can apply the 
Lemma 4.1 using X = {B,Eg) (part of data whose distribution is agreed on by defence 
and distribution) and Y = Ec (part of data whose distribution is disagreed on by defence 
and distribution). The LR can thus be developed in the following way: 

_lE(Pr(il/c = l|L/s = 1, H = hp, 0) I Eg = 1, B = b) _ 1 _ Q; + /3-l-iV + l 

^E(Pi{Ec = l\Eg = 1,H = hd,e)\Eg = l,B = b) ^ E(0 \Eg = l,B = b) ^ a + b + l 

( 2 ) 
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The last equality is due to the fact that, using the well known beta binomial conjugacy 
property, it holds that 

Q\B = b,Es = l^ Beta(a + b+ l,l3 + N — b). 

The LR as in (2), also proposed in Dawid and Mortera (1996) and Taroni et al. (2015), 
can be compared to the one obtained with the ‘standard’ Bayesian plug-in estimate (Weir, 
1996; Taroni et ah, 2010): 

+ + T 

a + b 

It is easy to see that the Bayesian plug-in LR is a non conservative estimate of LR, in a 
way that is unfavourable to the defence. Indeed, LR < LR 4^ ^ + N > b, which is always 
true, since b < N and (3 > 0. Notice that there is an alternative derivation for (2). It can 
be obtained in a two step evaluation: first, the observation of the database B and of the 
suspect haplotype Eg updates the probability Pr, then the updated probability is used to 
calculate the LR for the observation of another identical haplotype (the one found at the 
crime scene). 

First step The probability Pr is updated to Pr**(-) = Pr(-|Fs = 1, B = b) after the 
database and the haplotype of the suspect are observed. In practice, the prior 
distribution Beta(a, /3) on 6 is updated to the posterior Beta(Q; -|- 1, /3 -t- IV — 1). 

Second step The new probability Pr** is used to calculate the likelihood ratio for the 
observation of the haplotype from the crime scene: 

^ Pr**(Fe = l|g = hp) ^ 1 ^ 1 

Pr**(Fe=l|F = M E**(0) 


Sensitivity analysis. The sensitivity of the quantities log^o LR, log^o LR, and of the 
difference between them, to the hyperparameters a and [3 of the beta prior is shown in 
Figure 6, for the rare type case (i.e., 6 = 0), and with N = 100. In particular, the figure 
shows the variation of log^oLR (a), of the plug-in estimate logrgLR = log^oLR (b), and 
of the difference log^g LR — logrgLR (c), when different values of a (x axis) and /3 (only 
five values corresponding to the different lines) are chosen in the interval (0,20]. 

Observing Figure 6 (or analysing (2)), it can be seen that the three quantities of interest 
hardly depend on [3, while they decrease as a increases. In particular, when a decreases 
to 0, log;^oLR behaves as logig(l + (3 + N), while log^g LR increases to -|-oo. Another 
way to see this is that, for fixed (3, as a increases, the prior distribution on 6 resembles 
more and more to the degenerate distribution localised on the value 6 = 1 (notice this 
is inappropriate for the rare type match case). This means that the haplotype whose 
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(a): logioLR 


(b); logioLR 


(c): logio LR - logio LR 


Figure 6: Sensitivity analysis for the three quantities logj^g LR (a), log^gLR (b), 
logigLR — logj^gLR (c), in the beta binomial model, when a G (0,20] (x axis), and for 
f5 G {1,5,10,15,20} (corresponding to the different lines, where highest line corresponds 
to highest j3). 


population proportion is modelled through the random variable 0 (i.e., the haplotype 
of the crime stain and of the suspect) has probability one to be observed, which leads 
to LR = 1 (hence, log^g LR = 0). On the other hand, if a decreases to zero, the prior 
distribution over 9 tends to resemble to the degenerate distribution localised on the value 
0 = 0. This leads to LR = 1/0 = +oo. On the whole, the plug-in estimate LR is less 
stable than LR, as can be seen comparing Figures 6 (a) and (b), in the sense that is more 
sensitive to changes in a (especially for small values). The difference, represented in (c) 
has, for fixed /3, a vertical asymptote when a —)■ 0, increasing as fast as log^o 1/a- On the 
other hand it decreases to 0 with an horizontal asymptote when a —)• oo. From Figure 6 

(c) it can be observed that the difference is important only for small values of a. Otherwise 
the two methods would lead essentially to the same conclusions, so that the plug-in can 
be seen as a good approximation of the proper Bayesian procedure. 

6 Bayesian LR calculation, based on Dirichlet-multinomial 
model 

When database is treated as a multinomial sample of size N from a population with k 
different haplotypes, the conventional choice of the prior for the vector 6 containing the 
population frequencies of all the different haplotypes in Nature is the Dirichlet distribu¬ 
tion. Literature provides many examples of the use of this method for the frequencies of 
autosomal markers (Curran et ah, 2002; Balding, 1995; Lange, 1995; Weir, 1996; Buckleton 
and Curran, 2005; Taroni et ah, 2010, e.g.). However, these methods don’t consider the 
uncertainty about the number k of possible types in the population, and this can be a 
problem especially since we want to apply it to Y-STR haplotypes, for which the database 
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often does not offer a good coverage. If, in addition, we are using the model for the rare 
type match case, then we have to find a solution. The problem of estimating A: is a very 
challenging one. It has been addressed both with frequentist methods (Chao and Lee, 
1992; Haas and Stokes, 1998, e.g.) and with Bayesian methods (Hill, 1968; Lewins and 
Joanes, 1984; Barger and Bunge, 2010, e.g.). We propose the derivation of a full Bayesian 
LR which uses priors over the number k of different types in the population. The model is 
represented by the Bayesian network of Figure 7. The bottom part (from node © down) 
has a well-known structure (see Figure 4), while the upper part needs further explanation. 



Figure 7: Bayesian network for Dirichlet-multinomial model, when k is randomized. 


Assume that there may be at most m theoretically possible profiles alphabetically^ 
ordered in a vector, called s. For instance, m = 20^° (10 loci, with 20 possible alleles each). 
Only k of them are actually present in Nature (or more specihcally in the population of 
interest), but k is not known and also which of the m are those k is not known. 

We will denote as K the random variable which represents how many of the m po¬ 
tentially possible haplotypes are actually present in the population of interest. The prior 
distribution for k is denoted generically as p{k). The random vector Type, of length k, 
contains the ordered positions, in vector s, of the k haplotypes of the population of inter¬ 
est. A particular configuration of Type is denoted as t = (zi,..., i^), where ii < ... < z^. t 
is chosen uniformly at random from the possible (™) configurations. The random vector 
© contains the population proportions of all the haplotypes, both those whose position 
is contained in Type, and those that are not (corresponding to zero entries). A partic¬ 
ular conhguration of © is denoted 9 = {6i, ...,9m)-, many entries of which are zero. We 
assume that the positive entries, i.e., (0* | z G t), are drawn from a k dimensional Dirichlet 
^Remember each profile is a list of numbers. 
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distribution with all k hyper parameters a equal to 1. Now, as usual, H represents the 
hypotheses of interest, and can take the value h G {hp,/!^}, according to the prosecution 
or the defence, respectively. Eg and Ec contain the index e<j and Cc of the haplotypes of the 
suspect and of the crime scene, respectively. In the situation of interest Cc = e^. Lastly, 
random vector B represents the database, seen as a multinomial sample from the popula¬ 
tion with parameters N and 6. A particular conhguration of B is denoted b = {bi, ..., bm) 
representing the absolute frequency in the database of each of the m haplotypes. It con¬ 
tains kobs < k positive values, and many zeros. By applying Lemma 4.1 to this situation 
we have that 

_ E(Pr(£'e = eg I Llg = eg, B = b, 0, iL = hp) \ Eg = es,B = b) _1_ 

E(Pr(L;c = es\ Eg = es,B = h,&,H = hd) \Es = es,B = b) E(0e,, | = e^, B = b) ' 

It can be proven that for a = 1 this leads to 



E 

h —fcobsH” 1 


k 


m 


kobs -i-i7r(fe-i-A^-i-i) 


p{k) 


E 

k —fco6s“l” 1 


k 


m 


kobs -\-'^jT{k-\-N-\-2) 


p{k) 


(3) 


Notice that the likelihood ratio depends on the data only through kobs- This is due 
to the choice of the symmetric Dirichlet prior, and of the uniform prior for Type. In 
particular, this tells us that data can be reduced by sufficiency to kobs- The likelihood 
ratio obtained through a classical plug-in Bayesian estimation is: 


LR = 


ka + N 
a + be. 


k + N. 


(4) 


where the number of haplotypes is a fixed value k, to be chosen (or estimated) in advance. 
In order to compare the two values (4) and (3) we need to choose a value for k. A reasonable 
choice can be fc = E(Ar). Among the possible priors one can choose for K, we decided to 
test the Poisson distribution (see Section 6.1) and the negative binomial distribution (see 
Section 6.2). 


6.1 Poisson prior 


In this section a Poisson distribution with parameter A, truncated so as to have support 
only on {1,2, is chosen as prior distribution for K. 


p{k) := p{k] A) oc 


kl 

0 


if G {1,..., m} 
elsewhere 


where A > 0. If A and m are large enough, the normalising constant can be omitted and 
we have the standard Poisson distribution: 
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e~^\^ 

p[k;X) = ——, VfeeN 

k\ 

The LR in (3) becomes 

r{fc) 

y p _ 1 = + l r(7V + fe+l) 

“ 2 V™ Afc r(fc) 

2-^k=ko^s-^\ /c—/cobs~l! r(A^+fc+2) 

It is then of interest to analyse the quantities log^g LR, log^g LR, and the difference 
log;^o LR — logio LR between them and to carry on a sensitivity analysis to see how these 
quantities vary when parameter A changes. 

Sensitivity analysis In the rare type match problem (i.e., bg^ = 0), when a Poisson(A) 
prior is chosen for the dimension K of the Dirichlet distribution (with all parameters a 
equal to 1), the sensitivity of the three quantities log^g LR, logig LR, and of their difference, 
to A and kobs is shown in Figure 8, when N = 100. 




I 


(b); logic LR - logio LR 


Figure 8: Poisson prior, for the Dirichlet model. Sensitivity analysis of logio LR (a, black 
lines), logio LR (a, dashed line), and of the difference logio LR — logio LR (b), to different 
values of A G [1,10 000] (x axis), and kobs £ {30,40,50,60,70,80,90,100} (represented by 
the different lines where highest line corresponds to highest f3). 


In particular, it can be inferred that the LR depends little on kobs and a lot on A. When 
A is big (which is typically true A being the expected value of the number of different 
Y-STR haplotypes in a population) the LR depends almost only on A. In particular, 
LR increases linearly with A, since LR ~ A/2. This can be explained by replacing the 
Poisson prior on k, by the degenerate distribution localised on (the integer part of) A: 
fK{k) = f{k]X) = l{;^}(/c), for A G {1,2,....}. This approximation is sensible for large 
values of A in virtue of the law of large numbers (the Poisson(A) being the sum of A 
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Poisson(l)). In this case (3) becomes 


LR = 


1 + N + X 

- r\j 

2 


A 

2 ’ 


for A —>■ + 00 , and N fix. 


The plug-in estimates of log^g LR (as dehned in (4) and with the choice of = A) is 
the dashed line shown in Figure 8 (a). The difference between the ‘true’ value logigLR, 
and the estimated one log^gLR is shown in Figures 8 (b). In particular, one can see that, 
for big A, it decreases when A increases and depends a little on kobs, while for small values 
of A it has the opposite behaviour, and depends more strongly on kobs- Note that, again, 
the plug-in method overestimates the LR by up to almost half of an order of magnitude. 


6.2 Negative binomial prior 


A different choice is that of using as prior for k the negative binomial distribution (Hill, 
1968, 1979; Lewins and Joanes, 1984) as prior distribution for K. For our model a negative 
binomial distribution truncated so as to have support {1, ...,m} is more appropriate. It is 
defined as: 


Pr(iL = k\r, q) oc 


( ^){l-qYq'’ if A: G {1, ...,m} 


0 elsewhere 

where r > 0 and q G (0,1). However, if E(iL) is large, but small compared to m, the 
normalise factor is almost 1 and the standard negative binomial distribution can be used 
as prior distribution over K\ 


Vi{K = A:|r, q) 



(i-g)V, 


yk G N. 


Using this prior, the likelihood ratio in (3) becomes: 

V™ (^ ''fc r(fc) r(fc+r) 

^ p _ 1 yf r(k+N+i)r(k-k„ts) 

/, r(fc) r(fc+r) 

^A:=fcoi, 3 +lG yj r(k+N+2)r{k-k„t,) 


(5) 


In the following, a series of properties of the (zero truncated) negative binomial distri¬ 
bution will be listed, which will help to understand why this choice is more appropriate 
than the choice of the Poisson distribution as a prior for K. We will denote as NB(r, q) a 
random variable distributed according to a negative binomial with parameters r and g, and 
P(A) a random variable distributed according to a Poisson distribution with parameter A. 


1. The mean and variance of NB(r, g) are, respectively, E(NB(r, g))= (1 — q)r/q and 
Var(NB(r, q))= (1 — q)r/q‘^. This represents an advantage over the use of a Poisson 
distribution where these two quantities can’t be tuned independently one another, 
since E(P(A)) = Var(P(A)) = A. Thus, the use of a negative binomial prior guaran¬ 
tees more flexibility. 

2. The negative binomial NB(r, q) is a Gamma mixture of Poisson. 
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3. For fixed A=E(NB(r, ( 7 )), when r increases, the negative binomial NB(r, g) tends in 
distribution to P(A). This means that the negative binomial distribution can be seen 
as an extension of the Poisson distribution. 

The same properties apply to the [0, m]-truncated case, both for the Negative Binomial, 
and for the Poisson, if m is big enough and the probability of 0 is small. 

6.3 Sensitivity analysis 

A classical approach to sensitivity analysis for the negative binomial prior would be to 
analyse the sensitivity of log^oLR to changes of r and g, and kobs, the three parameters 
appearing in (5). However, we decided to use as parameters r, kobs (the number of different 
haplotypes observed in the database) and A, the mean value of the negative binomial prior. 
In this way it is easier to see how the results depend on the average number of haplotypes 
in Nature, and that for big r we fall back in the Poisson case, as explained in property 
3. Figure 1 represents the sensitivity analysis for logj^g LR, logio LR and the difference 
logio LR — logio LR in the rare type case ( 6 *^ = 0 ) for a = 1 , = 100 . 

It can be inferred from this analysis that when r increases the values depend more and 
more on A and less and less on kobs- 

According to the second column of Figure 1, one can see that for r > 100 the plug-in 
estimate always exceeds logio LR- Anyway, the difference is only significant if r is small, 
in particular for high values of A. 

6.4 Remarks about conventional priors 

As mentioned above, the beta distribution and the Dirichlet distribution are the conven¬ 
tional choices for the prior in case of binomial or multinomial model, respectively. As 
stated in Curran et al. (2002), this ‘remains the accepted standard in some laboratories” 
because of the “appeal of simplicity and ease of implementation”. Although we agree that 
this may have been a very sensible reason some decades ago, nowadays, with the com¬ 
putational skill provided by our computers, there are no more excuses to limit ourselves 
to these convenient priors. Indeed, a prior should reflect the expert beliefs rather than 
standards of computational ease. 

For the beta prior, the dependencies of the LR results on the value of the hyper 
parameter a stresses once more the need of a different choice. Moreover, the model is 
profile-specific, meaning that the beta priors is supposed to model the frequency of the 
profile observed at the crime scene. The model is thus to be defined after the data have 
been observed, and this seems to contradict common Bayesian principles. 

For the Dirichlet prior, we have a similar issue. The dimension k of this prior should 
correspond to the number of different DNA types in the population. This problem, which 
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for autosomal markers could be easily overcome, is more important when Y-STR haplo- 
types are considered, the state space being huge, and the database hardly representative. 
If we choose as k the number of different types observed in the database, then we are in 
trouble every time a new haploytpe is observed, as for the situation of interest for this 
paper. By treating k as a Bayesian would do for an unknown quantity, we expected the 
likelihood ratio to depend a lot on the mean value of the prior chosen for k. The Dirichlet 
method with all parameters a = 1, and with a prior over K, turned out to depend only 
on the number of observed haplotypes in the database (and not on their frequencies). 
This is actually unattractive for Y-STR data, and is due to the symmetry. The data 
does not overrule the prior which makes all the positive pi the same in size, and it is also 
the reason why the likelihood ratios obtained using the two methods (beta-binomial, and 
Dirichlet-multinomial) do not differ too much. Notice that for this prior we only focused 
on the case in which all the parameters a are equal to 1. More could have been done, 
for instance explore the sensitivity of the likelihood ratio to changes in the a (Triggs and 
Curran, 2006), or use hierarchical model (Chen and McGee, 2008). However, we preferred 
to investigate other types of prior (Cereda, 2015c) which we believe are more appropriate 
for Y-STR haplotypes frequencies. 

The two methods of Section 5 and Section 6 differ in the choice of information retained 
from the database. The Beta method only retains as information the frequency of the 
observed haplotype. A lot of information regarding other haplotypes is discarded, such 
as how many have been observed, and their frequencies. Let us point out that if there 
will ever be guidelines on how to choose the hyper parameters of the beta prior and of 
the Dirichlet prior, they should be compatible: hence the beta prior should be the one 
obtained from the Dirichlet by marginalisation. 

7 Conclusion 

This paper is intended to have several take-home messages. The first one is that a forensic 
statistician before starting any evaluation should make up his mind if he wants to use 
frequentist or Bayesian methods, since we have seen that the corresponding likelihood 
ratios are differently defined. If a Bayesian approach is chosen, which has the advantage 
that everything is combined into a single number, without any uncertainty involved, the 
LR should be calculated in a principled way. Bayesian plug-in (and frequentist plug¬ 
in), often proposed as proper Bayesian approach, can sometimes be seen as a convenient 
approximation of the Bayesian LR, but this paper has shown that the full Bayesian method 
is not more difficult. Moreover, the Bayesian plug-in is almost always anti-conservative 
in a way that is unfair to defence, and there are sometimes significant differences with 
the full Bayesian method for particular choices of the hyper-parameters of the priors. 
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All this has been shown when the conventional choices for the priori (beta or Dirichlet) 
are made. The choice of the prior is an issue indeed. We believe that a true Bayesian 
should not make use of conventional priors, but of his own priors. Especially because, as 
shown, this conventional choice leads to likelihood ratios which strongly depend on the 
hyperparameters of these priors. Choosing more realistic prior may increase the difficulty 
of the computation of the likelihood ratio, but, also thanks to modern computational tools, 
this should not stop people from preferring them. 
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